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Abstract 



Subgradient algorithms for training support vector 
machines have been quite successful for solving large- 
scale and online learning problems. However, they 
have been restricted to linear kernels and strongly con- 
vex formulations. This paper describes efficient sub- 
gradient approaches without such limitations. Our ap- 
proaches make use of randomized low-dimensional 
approximations to nonlinear kernels, and minimiza- 
tion of a reduced primal formulation using an al- 
gorithm based on robust stochastic approximation, 
which do not require strong convexity. Experiments 
illustrate that our approaches produce solutions of 
comparable prediction accuracy with the solutions ac- 
quired from existing SVM solvers, but often in much 
shorter time. We also suggest efficient prediction 
schemes that depend only on the dimension of kernel 
approximation, not on the number of support vectors. 



1 Introduction 

Support vector machines (SVMs) have been highly suc- 
cessful in machine learning and data mining. Derivation, 
implementation, and analysis of efficient solution methods 
for SVMs have been the subject of a great deal of research 
during the past 12 years. We broadly categorize the algo- 
rithms that have been proposed as follows. 

(i) Decomposition methods based on t h e dua l SVM 
formulation, in cludin g SMO (Piatt 'l999), LIB- 



SVM (Fan et al.l l2005h. SVM -Light (Joachims ,1999) . 



GPDT (ISerafini et al.l 2005h . and an online variant 
LASVM ffiordes et al.1 l2005h . The dual formulation 
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(ii) Cutting-plane methods using special primal formu- 
lations to successively add vi olated constrain ts to 
the for mulation. SVM-Perf Jjoachimsl l2006l) and 
OCAS ( Franc and Sonnenburg 200^ handle linear ker- 
nels, while the former ap proach is exten ded to non- 
linear kernels in CPNY ( Joachims et al.l l2009i) and 
CPSP Jjoachims and Yull2009h . 



(iii) Subgradient methods for the primal formulations. Avail- 
able codes include Pegasos (IShalev-Shwartz et al ] l2Q07h 
and SGD (Bottou 20051). These require Unear kernels 
and strong convexity of the SVM formulation. 

Subgradient methods are of particular interest, since they 
are well suited to large-scale and online learning problems. 
Each iteration of these methods consists of simple com- 
putation, usually involving a tiny subset of training data. 
Although a large number of iterations might be required to 
find high accuracy solutions, solutions of moderate accu- 
racy are often enough for learning purposes. Despite such 
benefits, no subgradient algorithms have yet been proposed 
for SVMs with nonlinear kernels, due mainly to the lack 
of explicit representations for feature mappings of interest- 
ing kernels, which are required in the primal formulations. 
This paper aims to provide practical subgradient algorithms 
for training SVMs with nonlinear kernels. 



allows nonlinear kernels to be i ntroduced neatly in to the 
formulation via the kernel trick dBoser et al ] ll992h . 



Unlike Pegasos (IShalev-Shwartz et alj|2007h . we use Vap- 
nik's original SVM formulation without modifying the 
objective to be strongly convex. Our main algorithm 
takes steplengths of size 0(1 /y^) (associated with robust 
stochastic approximation me thods dNemirovski et al.l2009l 
Nemrrovski and Yudin|ll983l) and online convex program- 
ming dZinkevichlbOOsf) ). rather than the 0{l/t) steplength 
scheme in Pegasos. Although the 0(1 /-y/F) schemes have 
slower convergence rate in theory, we see no significant 
performance difference in practice to 0{l/t) methods. As 
we discuss later, optimal choices of a tuning parameter in 
the objective often lead it to be nearly weakly convex, thus 
nearly breaking the assumption that underlies the 0(1 /t) 
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scheme. 

In our nonlinear-kernel formulation, we use low- 
dimensional approximations to the nonlinear feature map- 
pings, whose dimension can be chosen by users. We obtain 
such approximations either by approximating the Gram 
matrix or by constructing subspaces with random bases ap- 
proximating the feature spaces induced by kernels. These 
approximations can be computed and applied to data points 
iteratively, and thus are suited to an online context. Further, 
we suggest an efficient way to make predictions for test 
points using the approximate feature mappings, without re- 
covering the potentially large number of support vectors. 

2 Nonlinear SVMs in the Primal 

In this section we discuss the primal SVM formulation in a 
low-dimensional feature space induced by kernel approxi- 
mation. 

2.1 Structure of the Formulation 



We first analyze the structure of the primal SVM formu- 
lation with nonlinear feature mappings. To unveil the 
details, here we apply the tools of convex analysis rig- 
orou sly, rather than appealing t o the representer theo - 
rem dKimeldorf and Wahbal \l9m as in IChapelld (l2007h . 
where the idea was first introduced. 



Let us consider the training point and label pairs 
{{ti,yi)}'ILi for t,- e R" and e M, and a feature map- 
ping (j) : M" ^- M''. Given a convex loss function £{■) : M — 
R U {00} and A, > 0, the primal SVM problem (for classifi- 
cation) can be stated as follows : 



(PI) min 



-w w 



1 m 
111 



The necessary and sufficient optimahty conditions are 
1 



(la) 



= 1 

1 

m 1- 



0, 



(lb) 
,m. (Ic) 



for some X; G di {yi{w^(if{t,) + b)) , / = 1,2, 
where d£ is the subdifferential of £. 
We now consider the following substitution: 

m 

w = £a;(^(t,-) 

!=1 

(which mimics the form of ((Tab). Motivated by this expres- 
sion, we formulate the following problem 

X 1 "' 

-a^^a+-V^(y,(^,-.a + Z7)), 
2 m 



(2) 



(P2) 



mm 



where e R"^'" is defined by 

^„:=^(t,)^c^(t,), !,; = l,2,...,m, (3) 

and denotes the i-th row of ^. Optimality conditions 
for (P2) are as follows: 



1 m 



(4a) 



(4b) 



for some p; e 3^ a + l,2,...,m. (4c) 

We can now derive the following result via convex analysis, 
showing that the solution of (P2) can be used to derive a 
solution of (PI). This result can be regarded as a special 
case of the representer theorem. 

Proposition 1. Let {a,b) e R™ x R foe a solution of(P2). 
Then if we define w by (|2|, (w,fo) g M'' x R w a solution of 
(PI). 



Proof Since (a, b) solves (P2), the conditions (HI hold, for 
some pi, / = 1,2, . . . ,m. To prove the claim, it suffices to 
show that (w,fo) and % satisfy ([T]i, where w is defined by 
dU and %i = P, for all / = 1,2, . . . ,m. 

By substituting ((Sjl into (|4]i, we have 

m 1 m 

XY^(^{ij)mu)ai + - £ p,y,4(t;)^^(t,) = 0, 



:lP-3', = o, 



■ 1=1 



P/ e 3^ [yt y L c^(t;)^^(t,)a,- + fo j j , / = 1 , 2, . . . , m. 
From the first equality above, we have that 



-E(«' + £^P')*(t')+^-o, 



for some ^ e Null ^[(|)(tj)-^]™_j j . Since the two compo 
nents in this sum are orthogonal, we have 



= 



L(a, + ^P.jc^(tO 



which implies that ^ = 0. We can therefore rewrite the op- 
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timality conditions for (P2) as follows: 



i=i 



£(Xa, + ^p,)^(t,) = 0, (5a) 



-EP-y.^O, (5b) 



P,- e de l^, (^^(t,)^ Y.^aM^j)+bjj , / = 1,2, 



(5c) 



By defining w as in (|2l) and setting Xi ~ P; for all i, we see 
that (|5]) is identical to ([1]), as claimed. □ 

While ^ is clearly symmetric positive semidefinite, the 
proof makes no assumption about nonsingularity of this 
matrix, or uniqueness of the solution a of (P2). However, 
(l4a] i suggests that without loss of generality, we can con- 
strain a to have the form 



a,- 



-P„ 



where P, is restricted to d£. (For the hinge loss function 
£{8) max{l - 5,0}, we have p,- G [-1,0].) These re- 
sults clarify the connection between the expansion coeffi- 
ci ent a and t he dua l variable P(= x), which is introduced 
in IChapelld ( l2007h but not fully explicated there. Simi- 



lar arguments for the regression with the e-insensitive loss 
function £'{5) := max{ 1 5 1 — e, 0} leads to 



a,- 



Ail 



where p;.e [-1,1] is in dl'. 



2.2 Reformulation using Approximations 

Consider the original feature mapping (|)° : M" — >^ itf to 
a Hilbert space itf induced by a kernel ^° : K" x R" ^ 
M, where k° satisfies the condi tions of Mercer's Theo- 
rem (iScholkopf and Smola 2001). Suppose that we have 
a low-dimensional approximation (|) : M" ^ K'' of (|)° for 
which 



(6) 



for all inputs s and t of interest. If we construct a matrix 



V e 



$mxd 



for training examples ti,t2,...,tm by defining 



the i-th row as 



we have that 



^:=yy^«^°:=[F(t,-,t,)],-.;=i.2,...,„ 



(7) 



(8) 



Note that 'Pisa positive semidefinite rank-c/ approximation 
to By substituting ^ = VV^ in (P2), we obtain 

min -a^VV^a+-y e(yi(Vi.V^a + b)). (9) 

;=1 



A change of variables 



(10) 



leads to the equivalent formulation 

A, J m 

(PL) min -fy+-Y^e{yiiV,J+b)). 



This problem can be regarded as a linear SVM with trans- 
formed feature vectors Vl G K'', / = 1 , 2, . . . , m. An approx- 
imate solution to (PL) can be obtained with the subgradient 
algorithms discussed later in Section|3] 

Any a G M'" that solves the overdetermined system JTOl l 
will yield a solution of (|9]l. (Note that a satisfying (fTOl l 
need have at most d nonzeros.) In Section 12.41 we will 
discuss an efficient way to make predictions without recov- 
ering a. 

2.3 Approximating the Kernel 

We discuss two techniques for finding V that satisfies dHJ. 
The first uses randomized linear algebra to calculate a low- 
rank approximation to the Gram matrix The second 
approach uses random projections to construct approximate 
feature mappings (j) explicitly. 

2.3.1 Kernel Matrix Approximation 

Our first approach makes use of the Nystrom sampling 
idea (IDrineas and MahonevI 120051) . to find a good approx- 
imation of specified rank d to the m x m matrix in 
(|8]l. In this approach, we specify some integer s with 
Q) <d <s < m, and choose s elements at random from the 
index set { 1 , 2, . . . , m} to form a subset S ■ We then find the 
bestrank-t/ approximation W5 to and its pseudo- 

inverse W^^. We choose V so that 



VV 



5 ' 



(11) 



where CP") 5 denotes the column s ubmatrix of 'P" defined 
by the indices in S ■ The results in iDrineas and Mahonev 



(l2005h indicate that in expectation and with high probabil- 
ity, the rank-t/ approximation obtained by this process has 
an error that can be made as close as we wish to the best 
rank-c/ approximation by choosing s sufficiently large. 

To obtain W^j, we form the eigen-decomposition 
(^°)ss = QDQ^, where Q e E*'''' is orthogonal and D is 
a diagonal matrix with nonincreasing nonnegative diagonal 
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entries. Taking d <d tohe the number of positive diago- 
nals in D, we have that 

(where Q j j denotes the first J columns of Q, and so on). 
The pseudo-inverse is thus 

and the matrix V satisfying (fTTT i is therefore 



(12) 



This approximation method is less expensive than the pre- 
vious approach, requiring only 0{nd) operations for each 
data point (assuming that sampling of each vector V; G R" 
takes 0{n) time). As we observe in Section]?] however, 
this approach tends to give lower prediction accuracy than 
the first approach for a fixed d value. 

2.4 Efficient Prediction 

Given the solution {y,b) of (PL), we now describe how 
the prediction of a new data point t G M" can be made 
efficiently without recovering the support vector coeffi- 
cient a in (P2). The imposed low dimensionality of the 
approximate kernel in our approach can lead to signifi- 
cantly lower cost of prediction, as low as a fraction of 
ii/(no. support vectors) of the cost of an exact-kernel ap- 
proach. 

For the feature mapping approximation of Section 12.3.21 
we can simply use the decision function / suggested im- 
mediately by (PI), that is, /(t) = w^(|)(t) +/?. Using the 
definitions Q, (|7]i, and (fTOl i. we obtain 

m 

/(t) = ^(t)^£a,^(t0 + /7 

i=l 

The time complexity in this case is 0{nd). 

For the kernel matrix approximation approach of Sec- 
The second approach to defining V finds a mapping ^ : tion|233] the decision function w^c^(t) + b cannot be used 



For practical implementation, rather than defining d a pri- 
ori, we can choose a threshold with < e^/ ^ 1, then 
choose d to be the largest integer in l,2,...,i such that 
Ddd > Erf- (In this case, we have d = d.) 

For each sample set 5, this approach requires 0{ns^ + s^) 
operations for the creation and factorization of as- 
suming that the evaluation of each kernel entry takes 0{n) 
time. Since our algorithm only requires a single row of V 
in each iteration, the computation cost of ( fT2] ) can be amor- 
tized over iterations: the cost is 0{sd) per iteration if the 
corresponding row of is available; 0{ns + sd) other- 



2.3.2 Feature Mapping Approximation 



that satisfies 

(f(s),f (t))=E[(^(s),(^(t))], 

where the expectation is over the random variables 
that determine (|). The approximate mapping (|) can 
be c onstructed exphcitly by random projections as fol- 
low dRahimi and Rechtll2008l) . 



directly, as we have no way to evaluate (|)(t) for an arbitrary 
point t. We can however use the approximation (|6]l to note 
that 



[cos(v[t + coi),--- ,cos(v5't + co,/)]^ (13) 



where Vi , . . . , V,/ e M" are i.i.d. samples from a distribution 
with density p{y), and coi , . . . , co^/ G K are from the uniform 
distribution on [0,27t]. The density function p{y) is deter- 
mined by the types of the kernels we want to use. For the 
Gaussian kernel 



r(s,t)=exp(-o|ls-t|l2), 



(14) 



we have 



P(v) 



1 



:exp 



V ; 



(47i:o)''/2 \ 4a 
from the Fourier transformation of k° . 



(^(t)^w + Z7 = £a,-^(t)^^(t,) + fe 

!=1 

m 

« J^a,F(t;,t) + Z7, (15) 

!=1 

SO we can define the function ( fTSl l to be the decision func- 
tion. To evaluate this, we need only compute those kernel 
values A:°(t,-,t) for which a, ^ 0. As noted in Section |Z2l 
we can satisfy ( fTOb by using just d nonzero components of 
a, so ( fTsT i requires only d kernel evaluations. 

If we set a,- = for all components i ^ S, where 5 is the 
sample set from Section l23] and s = d = d,we can compute 
a that approximately satisfies ( fTOl i without performing fur- 
ther matrix factorizations. Denoting the nonzero subvector 
of a by aj, we have V^a = Vja^ ~ J, so from ( fT2b and 
the fact that i^°)ss QDQ^ , we have 



a.s- 



d'^'- -Q 



T 
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-1/2 

That is, as = Q.\ jD^ j-^ jy, which can be computed in 

0{d^) time. Therefore, prediction of a test point will take 
0{d^ + nd) for this approach, including kernel evaluation 
time. 

3 Stochastic Approximation Algorithm 

We describe here a stochastic approximation algorithm for 
solving the linear SVM reformulation (PL). Consider the 
general convex optimization problem 



min f(x), 



(16) 



where / is a convex function and X is a bounded closed 
convex set with the radius Dx defined by 



Dx '■— max lljclh. 



(17) 



We use g(x) to denote a particular subgradient of /(x). By 
convexity of /, we have 

/(x') -/(x) > g(x)^(x' -x), Vx,x' e X, Vg(x) e 3/(x). 
/ is strongly convex when there exists /j > such that 



{x'-xf[g{x')-g{x)] 



> 



/J\\X 



for all x,x' e X, all g{x) e 8/(x), and all g{x') G 3/(x'). 
Note that the objective in (PL ) is strongly convex in Y, bu t 
only convex in b. Pegasos ( Shalev-Shwartz et alj |2007|) 
requires / to be strongly convex in all variables and thus 
modifies the SVM formulation to have this property. The 
approach we describe below is suitable for the original 
SVM formulation. 

3.1 The ASSET Algorithm 

Our algorithm assumes that at any x E X, we have avail- 
able G(x;^), a stochastic subgradient estimate depending 
on random variable ^ that satisfies E[G(x;^)] = g{x) for 
some g{x) £ 3/(x). The norm deviation of the stochastic 
subgradients is measured by Dq defined as follows: 



E[||G(x;^)||2]<d2 



Vxex,^eS. 



(18) 



Iterate Update: At iteration j, the algorithm takes the 
following step: 



x'=n;,(x^-'-Tl,G(x^-';^^)), j=l,2,..., 

where t,^ is a random variable (i.i.d. with the random vari- 
ables used at previous iterations). Tlx is the Euclidean pro- 
jection onto X, and r[j > is a step length. For our prob- 
lem (PL), we have x-' — (y-',/?-'), and is selected to be 
one of the indices { 1 , 2, . . . , m} with equal probability, and 



the subgradient estimate is constructed from the subgradi- 
ent for the ^-' th term in the summation of the empirical loss 
term. Table [T| summarizes the subgradients G(x-'^';^^) for 
classification and regression tasks, with the hinge loss and 
the e-insensitive loss functions respectively. 

Feasible Sets: We define the feasible set X to be the 
Cartesian product of a ball in the y component with an in- 
terval [—B,B] for the b component. The following shows 
the set X for classification, for which the radius of the ball 
is derived using strong duality (?; Theorem 1): 



X 



< 1/VX, \b\<B^ 



for sufficiently large B, resulting in Dx — \f\fk + W-. For 
regression, the following theorem provides a radius for y: 

Theorem 1. For SVM regression using the e- 
insensitive loss function with < e < ||y||oo, where 
y {yi,y2,---,ymY , we have 



lly||2< 



Proof. We can write an equivalent formulation of (PL) as 
follows: 



1 



min -/■y+C^max{|3;,-(y^(l)(t;; 
y,fo 2 " 



-^)|-e,0}, 



1=1 



for C = 1/ (km). The corresponding Lagrange dual formu- 
lation is 

1 m m 

- o E E {7!,^z,Wj-z,i){mM^j)) 

m m 

-eE(z'- + Z'-) + E3'K2;-^<) 

!=1 i=l 



s.t. 



0, 



0<Zi<C, 0<4<C, /=l,2,...,m. 

Let {"fib*) and (z*,z'*) be the optimal solutions of the pri- 
mal and the dual formulations, respectively. Also, from 
the KKT conditions we have y* = UP^iiz'* - z*)(^iti). Re- 
placing this in the optimal dual objective, and using strong 
duality, we have 



< 



-{fff +C^max{b,- ((r)^^(tO+^*)| -6,0} 



= 1 

m 

-eEfe 

(=1 



< 



:(f)V+2(||y||. 



-z*) + Y^yi{t 

!=1 

-e)lklli- 



■Zi, 
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Table 1: Loss functions and their corresponding subgradients for classification and regression tasks. 



Task 


LUoa rUllOLlUll, K. 


Subgradient, G ^ j 


Classification 


max{l -3/(w^(j)(t) + /7),0} 




dj 


, dj = < 


otherwise 


Regression 


maxjly- (w^(j)(t)+/7)| -£,0} 




-XyJ-'+djV(,: 
dj 


, dj=< 


1 if y^j <V^j.yj-'+bj-'-e, 
otherwise. 



Since < z; < C, we have < C and thus ||z|| i < Cm — 
1 /X. Applying this to the above inequality leads to our 
claim. We exclude the case e > ||y||oo, where the optimal 
solution is trivially {y*,b*) — (0,0). □ 

Averaged Iterates: The solution of ( fTSl i is estimated not 
by the iterates but rather by a weighted sum of the fi- 
nal few iterates. Specifically, if we define to be the total 
number of iterates to be used and .^V < to be the point at 
which we start averaging, the final reported solution esti- 
mate would be 



These is no need to store all the iterates x' , t = N,N + 
l,...,N in order to evaluate the average. Instead, a running 
average can be maintained over the last N — N iterations, 
requiring the storage of only a single extra vector 

Estimation of Dg'- The steplength r[j requires knowl- 
edge of the subgradient estimate deviation Dq defined in 
( fTSl l. We use a small random sample of training data in- 
dexed by / 1,2,...,M, at the first iterate {'f,b^), 
and estimate as 



M 



M^dfrn 



1=1 



1). 



We summarize this framework in Algorithm [T| and refer it 
as ASSET. The integer .^V > specifies the iterate at which 
the algorithm starts averaging the iterates, which can be set 
to 1 to average all iterates, to a predetermined maximum it- 
eration number to output the last iterate without averaging, 
or to a number in between. 

3.2 Convergence 



The 
tion 



analysis of r o bust st ochastic approxima- 



(INemirovski et al 



20091 INemirovski and YudinI 
1983h provides theoretical support for the algorithm 
above. Considering Algorithm [T] applied to the general 
formulation ( fTSl l. and denoting the algorithm's output x^'^, 
we have the following result. 



Algorithm 1 ASSET Algorithm 



1: Input^ r = {(ti,yi), . . . ,Jt,n,ym)}, X, positive inte- 
gers and with <N <N, and Dx and Dq satisfy- 
ing dnli and dH; 

2: Set {f,b'^) = (0,0), (y,^) = (0,0), fj = 0; 

3: for./ = l,2,...,A^do 

, m} at random. 



Choose t,-' 



Compute G 



( 


'yJ- 1 " 




bJ-\ 


( 


"y/-l" 




bj-\ 



for V as in ( fT2] i. or 
for (])(•) asinO . 

following Table m 



bJ-' 



9: 
10: 



if ; > A^ then 

{update averaged iterate} 





11 






y- 


b 


fj+Tl; 


b 




bJ 



Tl=ri+ri^. 

end if 
end for 

Define f'^:=y and b^'^ := b. 



Theorem 2. Given the output and optimal function 
value f(x*), Algorithm\l\satisfies 



E[f{x'^')-f{x*)]<C{p) 



DxDg 



N 



where C(p) solely depends on the fraction p e (0,1) /or 
which N = IpNl . 

3.3 Strongly Convex Case 

Suppose that we omit the intercept b from the linear for- 
mulation (PL). Then its objective function /(x) becomes 
strongly convex for all of its variables. In this special case 
we can apply different steplength r[j — l/{Xj) to achieve 
faster convergence in theory. The algorithm remains the 
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same as Algorithm [T| except that averaging is no longer 
needed and a faster convergence rate can be proved - essen- 



tially a rate of 1/j rather than 1 /y7 (see lNemirovski et al. 
(l2009l) :br a general proof): 



Theorem 3. Given the output and optimal function 
value f(x*), Algorithm\l\with r\j — 1 /(A,y') satisfies 



£[/(x^)-/(x*)]<max 




Dl /N. 



Note that when A, w 0, that is, when the strong convexity 
is very weak, the convergence of this approach can be very 
slow unless we have Dq w as well. 

Without the intercept b, the feasible set X is simplified only 
for the y component, and the update steps are changed ac- 
cordingly. The resulting algorith m, we refer is a s ASSET*, 



is the same as Peg asos ( Shalev-Shwartz et al.l 12007 ) and 



SGD (lBottoull2005h . except for our extensions to nonlin- 
ear kernels. 



4 Computational Results 

We implemented our algorithms based on the open-source 
Pegasos codeQ. We refer our algorithms with kernel matrix 
approximation as ASSETm and ASSET^ (for the versions 
that requires convexity and strong convexity, respectively) 
and with feature mapping approximation as ASSET/r and 
ASSET^. In the interests of making direct comparisons 
with other codes, we do not include intercept terms in our 
experiments, since some of the other codes do not allow 
such terms to be used without penalization. 

We run all experiments on load-free 64-bit Linux systems 
with 2.66 GHz processors and 8 GB memory. Kernel cache 
size is set to 1 GB when applicable. All experiments with 
randomness are repeated 50 times unless otherwise speci- 
fied. 

Table |2] summarizes the six binary classification tasks we 
use for the experiment^ The ADULT data set is randomly 
split into training/validation/test sets. In the MNIST data set, 
we obtain a binary problem by classifying th e digits 0-4 
versus 5-9. In CCAT from the RCVl collection d Lewis et aT 



20041) . we use the original test set as the training set, and 
divide the original training set into validation and test sets. 
IJCNN is constructed by a random splitting of the IJCNN 
2001 Challenge data sell In COVTYPE, the binary problem 



'Our code is available at 

[http://pages.cs.wisc.edu/~skiee7asset/ and Pegasos is 
from http : / /mloss . org/ soft ware/ view/ 35/ 

^ADULT, MNIST, CCAT and COVTYPE data sets are from the UCI 



is to classify type 1 against the other forest cover types. 
Finally, MNIST-E is an extended set of MNIST, generated 
with elastic deformation of the orig inal digitfl Table |2] 
also indicates the values of the regularization parameter A, 
and Gaussian kern el parameter g in (fT4l) selected using the 
SVM-Light solver d Joachim si 19991) to maximize the classi- 
fication accuracy on each validation set. (For MNIST-E we 
use the same parameters as in MNIST.) 

For the first five moderate-size tasks, we compare all of our 
algorithms against four publicly available codes. Two of 



these are the cutting-plane methods CPNY (Joa chims et al. 
[2009) and CPSP (Joachims and Yu 2009) that are im- 
plemented in the version 3.0 of of SVM-Perf. Both 
search for a solution as a linear combination of approx- 
imate basis functions, where the approximation is based 
on Nystrom sampling (CPNY) or on constructing opti- 
mal bases ( CPSP). Th e other two comparison codes are 
SVM-Light ( |joachimsl [T999), which solves the dual SVM 
formulati on via a succession of small subproblems, and 
LASVM dBordes et al.ll2005b . which makes a single pass 
over the data, selecting pairs of examples to opti mize with 
the S MO algorit hm. The original SVM-Perf djoachims 
2006) and OCAS d Franc and Sonnenburg|2008l) are not in- 



cluded in the comparison because they cannot handle non- 
linear kernels. 

For the final large-scale task with MNIST-E data set, we 
compare our algorithms using feature mapping approxi- 
mation - ASSET/r and ASSETJi- - to the online algorithm 
LASVM. 

For our codes, the averaging parameter is set to N ^ m — 
100 for all experiments (that is, averaging is performed for 
the final 100 iterates), and the error values are computed 
using the efficient classification schemes of Section l24l 

4.1 Accuracy vs. Approximation Dimension 

The first experiment investigates the effect of kernel ap- 
proximation dimension on classification accuracy. We set 
the dimension parameter s in Section 12.31 to values in the 
range [2, 1024], with the eigenvalue threshold e^/ = 10^'^. 
Note that s is an upper bound on the actual dimension d of 
approximation for ASSET]j^ , but is equal to d in the case 

of ASSET^*\ The CPSP and CPNY have a parameter sim- 
ilar to s (as an upper bound of d); we compared by setting 
that parameter to the same values as for s. 

For the first five moderate-size tasks, we ran our algorithms 
for 1000 epochs (1000m iterations) so that they converged 
to a near-optimal value with small variation among differ- 
ent randomization. We obtained the baseline performance 
of these tasks by running SVM-Light. SVM-Light does not 



Repository, |http : / /archive ■ ics.uci.edu/mr/, 

^http : / /www, csie .ntu .edu ■ tw/- c jlin/libsvmtools/datasets/j^ittp : / /leon .bottou ■ org/papers/loosli-canu-bottou-2006 / 1 



Approximate Stochastic Subgradient Estimation Training for SVMs 



Table 2: Data sets and training parameters. 



Name 


m (train) 


valid/test 


n 


(density) 


A. 


o 


ADULT 


32561 


8140/8141 


123 


(11.2%) 


3.07e-08 


0.001 


MNIST 


58100 


5950/5950 


784 


(19.1%) 


1.72e-07 


0.01 


CCAT 


78127 


11575/11574 


47237 


(1.6%) 


1.28e-06 


1.0 


IJCNN 


113352 


14170/14169 


22 


(56.5%) 


8.82e-08 


1.0 


COVTYPE 


464809 


58102/58101 


54 


(21.7%) 


7.17e-07 


1.0 


MNIST-E 


1000000 


20000/20000 


784 


(25.6%) 


l.OOe-08 


0.01 




(a) ADULT (b) MNIST (c) CCAT 




2 4 6 8 10 2 4 6 8 10 



(d) IJCNN (e) COVTYPE 

Figure 1 : The effect of the approximation dimension to the test error. The x-axis shows the values of s in log scale (base 
2). 



have dimension parameters but can be expected to give the 
best achievable performance by the kernel-approximate al- 
gorithms as s approaches m. 

Figure [U shows the results. Since ASSETm and ASSET^^ 
yield very similar results in all experiments, we do not plot 
ASSET]^. (For the same reason we show only ASSET/r 
without ASSET^.) When the value of a is very small, as in 
Figure [T(a)| of ADULT data set, all codes achieve good clas- 
sification performance for small dimension. In other data 
sets, the chosen values of a are larger and the intrinsic rank 
of the kernel matrix is higher, so classification performance 
continues to improve as s increases. 

Interestingly, ASSET/r (feature mapping approximation) 
seems to require more dimension than ASSETm (kernel 
matrix approximation) to produce similar classification ac- 



curacy. However in practice we can specify larger dimen- 
sion for ASSETf than for ASSETm since the former re- 
quires less computation than the latter For a given dimen- 
sion, the overall performance of ASSET/r is worse than 
other methods, especially in the CCAT experiment. 

The cutting plane method CPSP generally requires lower 
dimension than the others to achieve the same predic- 
tion performance. This is because CPSP spends extra 
time to construct optimal basis functions, whereas the 
other methods depend on random sampling. However, all 
approximate-kernel methods including CPSP suffer con- 
siderably from the restriction in dimension for the COVTYPE 
task. 
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Table 3: Training CPU time (in seconds, h:hours) and test error rate (%) in parentheses. Kernel approximation dimension is 
varied by setting 5 = 512 and s = 1024 for ASSETm, ASSET;^^, CPSP and CPNY. Decomposition methods do not depend 
on 5, so their results are the same in both tables. 





Subgradient Methods 


Cutting -plane 


Decomposition 


.9 = 512 


ASSETm 


asset;^^ 


CPSP 


CPNY 


LASVM 


SVM-Light 


ADULT 

MNIST 

CCAT 

IJCNN 

COVTYPE 


97 (4.0±0.05) 
95 (8.2±0.08) 
87 (1.1 ±0.02) 
697(18.2±0.06) 


\ i. »7 . J. -J—\J. \J\J J 

101 (4.0±0.04) 
99 (8.3±0.06) 
89 (1.1 ±0.02) 

586(18.2±0.07) 


3020n 5 1\ 
550 (2.7) 
800 (5.2) 
727 (0.8) 
1.8h(17.7) 


8 2hn5 \\ 
348 (4.1) 
62 (8.3) 
320 (1.1) 
1842(18.2) 


ion n8 o") 

588 (1.4) 
2616 (4.7) 

288 (0.8) 
38.3h(13.5) 


857d5 \^ 
1323 (1.2) 
3423 (4.7) 
1331 (0.7) 
52.7h(13.8) 


s = 1024 


ASSETm 


asset;^^ 


CPSP 


CPNY 


LASVM 


SVM-Light 


ADULT 

MNIST 

CCAT 

IJCNN 

COVTYPE 


78(15. 1±0.05) 
275 (2.7±0.03) 
265 (7.1 ±0.05) 
307 (0.8±0.02) 
2259(16.5±0.04) 


83(15. 1±0.04) 
275 (2.7±0.02) 
278 (7.1 ±0.04) 
297 (0.8±0.01) 
2064(16.5±0.06) 


3399(15.2) 
1273 (2.0) 
2950 (5.2) 
1649 (0.8) 
4.1h(16.6) 


7.5h(15.2) 
515 (2.7) 
123 (7.2) 
598 (0.8) 

3598(16.5) 


1011(18.0) 
588 (1.4) 

2616 (4.7) 
288 (0.8) 
38.3h(13.5) 


857(15.1) 
1323 (1.2) 
3423 (4.7) 
1331 (0.7) 
52.7h(13.8) 



4.2 Speed of Achieving Similar Test Error 



4.3 Large-Scale Performance 



Here we ran all algorithms other than ours with their de- 
fault stopping criteria. For ASSETm and ASSETm, we 
checked the classification error on the test sets ten times 
per epoch, terminating when the error matched the perfor- 
mance of CPNY. (Since this code uses a similar Nystrom 
approximation of the kernel, it is the one most directly com- 
parable with ours in terms of classification accuracy.) The 
test error was measured using the iterate averaged over the 
100 iterations immediately preceding each checkpoint. 

Results for the first five data sets are shown in Table [3] for 
.9 = 512 and s = 1024. (Note that LASVM and SVM-Light 
do not depend on s and so their results are the same in 
both tables.) Our methods are the fastest in most cases. 
Although the best classification errors among the approx- 
imate codes are obtained by CPSP, the runtimes of CPSP 
are considerably longer than for our methods. In fact, if 
we compare the performance of ASSETm with s = 1024 
and CPSP with s = 512, ASSETm achieves similar test ac- 
curacy to CPSP (except for CCAT) but is faster by a factor 
between two and forty. CPNY requires an abnormally long 
run time on the ADULT data set; we surmise the code may 
be affected by numerical difficulties. 

It is noteworthy that ASSETm shows similar performance 
to ASSET^J^ despite the less impressive theoretical conver- 
gence rate of the former This is because the values of op- 
timal regularization parameter X were near zero in our ex- 
periments, and thus the objective function lost the strong 
convexity condition required for ASSETm to work. We 
observed similar slowdown of Pegasos and SGD when A- 
approaches zero for Unear SVMs. 



We take the final data set MNIST-E and compare the per- 
formance of ASSETf and ASSET^ to the onHne SVM 
code LASVM. (Other algorithms such as CPSP, CPNY, and 
SVM-Light are less suitable for large-scale comparison be- 
cause they operate in batch mode.) For a fair comparison, 
we fed the training samples to the algorithms in the same 
order. 

Figure |2] shows the progress on a single run of our algo- 
rithms, with various approximation dimensions d (which is 
equal to s in this case) in the range [1024, 16384]. Verti- 
cal bars in the graphs indicate the completion of training. 
ASSETf tends to converge faster and shows smaller test 
error values than ASSET^, despite the theoretical slower 
convergence rate of the former With d = 16384, ASSET/? 
and ASSET^ required 7.2 hours to finish with a solution of 
2.7% and 3.5% test error rate, respectively. LASVM pro- 
duced a better solution with only 0.2% test error rate, but it 
required 4.3 days of computation to complete a single pass 
through the same training data. 



5 Conclusion 



We have proposed a stochastic gradient framework for 
training large-scale and online SVMs using efficient ap- 
proximations to nonlinear kernels. Since our approach does 
not require strong convexity of the objective function or 
dual reformulations for kernelization, it can be extended 
easily to other kernel-based learning problems. 
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ASSETp - 






ASSET* " 






Figure 2: Progress of ASSETf and ASSET^ to their com- 
pletion (MNIST-E), in terms of test error rate. 
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